Skip to content

Optimize simple time ranged search queries #5759

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

tontinton
Copy link
Contributor

@tontinton tontinton commented Apr 19, 2025

When the search request contains a time range, we aborted the optimization of converting unneeded split searches into count queries.

Removed the TODO that was in the code for this.

Split the PR into 2: #5758.

@tontinton tontinton changed the title Optimize time ranged search queries Optimize simple time ranged search queries Apr 19, 2025
@tontinton tontinton force-pushed the optimize-timestamp-range-simple-search branch 2 times, most recently from 804be29 to 61c3b74 Compare April 19, 2025 19:51
Copy link
Collaborator

@rdettai rdettai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your PR. The way the logic is split here between optimize_split_order and optimize makes it very hard to proof read. Currently it's already a bit confusing because the optimize() logic for each variant depends on the sort order which is different for each variant. It would be good if we could avoid tightening the coupling between the two match statements even further by making the sort orders more complex.

Comment on lines 1076 to 1082
let min_required_splits = splits
.iter()
// splits are sorted by whether they are contained in the request time range
.filter(|split| Self::is_contained(split, &request))
.map(|split| split.num_docs)
// computing the partial sum
.scan(0u64, |partial_sum: &mut u64, num_docs_in_split: u64| {
Copy link
Collaborator

@rdettai rdettai Apr 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why this works? Say you have 5 splits:

  • 1 is contained in the request
  • 4 are only overlapping the request

In that case min_required_splits would be 1 here, but there might be a biggest_end_timestamp or smallest_start_timestamp in any of the other splits.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thoughts were that there's a

.take_while(|partial_sum| *partial_sum < num_requested_docs)

so min_required_splits would not get to be 1, but now I see that I should validate that we actually reached this condition, and only if we reached it to set min_required_splits and optimize, otherwise return early.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed a quick fix, I need to test it, but I think it answers the problem.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know of std::ops::ControlFlow 😄. Nevertheless, this reaches the limit when iterators stop being practical. A for loop is much more readable here.

@tontinton tontinton force-pushed the optimize-timestamp-range-simple-search branch from 61c3b74 to 851c15c Compare April 22, 2025 20:31
@tontinton
Copy link
Contributor Author

Thanks for your PR. The way the logic is split here between optimize_split_order and optimize makes it very hard to proof read. Currently it's already a bit confusing because the optimize() logic for each variant depends on the sort order which is different for each variant. It would be good if we could avoid tightening the coupling between the two match statements even further by making the sort orders more complex.

It is a bit confusing, I'll try to think of a better way to do this when I have some more time.

@tontinton tontinton force-pushed the optimize-timestamp-range-simple-search branch 2 times, most recently from 103b8d9 to 69e85ff Compare April 22, 2025 20:44
@rdettai
Copy link
Collaborator

rdettai commented Apr 23, 2025

I would try refactoring this into:

fn optimize()
  match self {
    CandSplitDoBetter::SplitIdHigher(_) => optimize_split_id_higher()
    CandSplitDoBetter::SplitTimestampLower(_) => optimize_split_timesamp_lower()
    ...
    CanSplitDoBetter::Uninformative => {}
  }

where optimize_split_xxx:

  • sorts
  • returns early if !is_simple_all_query
  • applies its variant specific optimization

This would regroup the optimizations logics with the sorts they depend on to be correct. (To help the review, ideally, re-implement the current logic in 1 commit, and in a separate commit add you logic extension.)

Thanks!

@tontinton tontinton force-pushed the optimize-timestamp-range-simple-search branch 3 times, most recently from ed6cd7b to 3055965 Compare April 29, 2025 18:33
@tontinton tontinton requested a review from rdettai April 29, 2025 18:36
@tontinton tontinton force-pushed the optimize-timestamp-range-simple-search branch 3 times, most recently from cb52ca6 to 6b427b5 Compare April 29, 2025 18:43
Copy link
Collaborator

@rdettai rdettai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This is a bit more verbose, but I find it easier to read!

Check the comment marked "MORE IMPORTANT" first 😉

Comment on lines 1051 to 1055
if partial_sum >= num_requested_docs {
return Some(min_required_splits + 1);
}
CanSplitDoBetter::SplitTimestampLower(_) => {
splits.sort_unstable_by_key(|split| split.timestamp_start())

min_required_splits += 1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, this is more readable

min_required_splits += 1;
if partial_sum >= num_requested_docs {
    return Some(min_required_splits);
}

Comment on lines +1035 to +1039
fn get_min_required_splits(
splits: &[SplitIdAndFooterOffsets],
request: &SearchRequest,
) -> Option<usize> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, this is now really straightforward to understand and verify 😄

Comment on lines 1096 to 1095
let contained = Self::is_split_contained_in_search_time_range(split, &request);
(!contained, std::cmp::Reverse(split.timestamp_end()))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is really hard to know here if contained splits come first or second (the ordering of boolean is not obvious, the ! negation doesn't help 😄 ). Could we maybe explicitly map true/false to something that is more clearly ordered:

let contained_first = match Self::is_split_contained_in_search_time_range(split, &request) {
  true => 0u8,
  false => 1u8,
};
(contained_first, std::cmp::Reverse(split.timestamp_end()))

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MORE IMPORTANT: now that I think about it, I thing it's not good idea to have contained splits processed first in the general case (!is_simple_all_query or get_min_required_splits().is_none()`). You miss an optimization opportunity, because once you'll have processed them, you will still need to check the splits overlapping the higher end of the range for better matches.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, we can maybe instead of sorting by contained, keep the same sort mechanism as it was.

but still keep the new implementation of get_min_required_splits, meaning we won't optimize the first few splits which might not be contained in the search query's range.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made the change to not sort, tell me what you think.

Comment on lines +1025 to +1029
splits
.into_iter()
.map(|split| (split, (*request).clone()))
.collect::<Vec<_>>()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep the comment:

// TODO: we maybe want here some deduplication + Cow logic

Comment on lines 1061 to 1064
fn optimize_split_id_higher(
&self,
request: Arc<SearchRequest>,
mut splits: Vec<SplitIdAndFooterOffsets>,
) -> Result<Vec<(SplitIdAndFooterOffsets, SearchRequest)>, SearchError> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could actually call to_splits_with_request before (inside optimize) and simplify this signature to something like:

Suggested change
fn optimize_split_id_higher(
&self,
request: Arc<SearchRequest>,
mut splits: Vec<SplitIdAndFooterOffsets>,
) -> Result<Vec<(SplitIdAndFooterOffsets, SearchRequest)>, SearchError> {
fn optimize_split_id_higher(
&mut [(SplitIdAndFooterOffsets, SearchRequest)]
) -> Result<(), SearchError> {

This would prevent quite a lot of repetition and is also more explicit because it shows that we don't drop splits (note that &self is also dropped because not used)
(applies to all optimzie_split_xxx)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I modified to remove &self, tried what you suggested, but then simple stuff like calling is_simple_all_query that accepts a &SearchRequest became ugly, WDYT?

@tontinton tontinton force-pushed the optimize-timestamp-range-simple-search branch 2 times, most recently from 4758505 to b7ea08e Compare April 30, 2025 20:55
@tontinton tontinton force-pushed the optimize-timestamp-range-simple-search branch from b7ea08e to f066b6c Compare April 30, 2025 21:02
When the search request contains a time range, we aborted the
optimization of converting unneeded split searches into count queries.
@tontinton tontinton force-pushed the optimize-timestamp-range-simple-search branch from f066b6c to 110b53a Compare April 30, 2025 21:03
@tontinton tontinton requested a review from rdettai April 30, 2025 21:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants